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Outline 


Q  introduction 

•  motivation 

•  process  of  mining  data 

Q  features 

•  visualisation 

Q  exploration 

•  statistics 

•  clustering 

Q  Association  Rule  Mining 

•  algorithm 

•  tool 

•  example 
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Google  Query  Suggestion 


Find  similar  words  with  more  hits. 

Web    linages    Groups    News    more  » 


Google 


Imathtics 


Web 

Did  you  mean:  mathematics 
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Amazon  Recommender  System 


Item  shown:  Holy  Bible,  King  James  Version 

Customers  who  bought  this  book  also  bought: 

•  Holy  Bible  King  lames  Version  Study  Bible  (Burgundy)  by  Not  Applicable  (Na  ) 

•  The  Holy  Quran:  An  English  Translation  by  Allamah  Nooruddin 

•  The  Torah  by  Rodney,  Rabbi  Mariner 

•  The  Qur'an  Translation  by  Abdullah  Yusuf  Ali 

•  The  Holy  Bible:  King  lames  Version  by  Not  Applicable  (Na  ) 

Assocation  Rule:  A  <—  B 
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process  of  mining  data 


first  of  all:  define  your  objective 
then: 

O  data  collection 
Q  feature  extraction 
O  data  cleaning 

O  exploration  —  summaries,  clustering 
O  rule  mining  and/or  classification 
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types  of  attributes 


very  simple  world  view: 

binary  true,  false;  present,  not  present 
nominal  blue,  red,  green 
ordinal  drizzle  <  rain  <  torrent 
numeric  4.45,  5.76,  19.33 
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data  mining  on  eMail: 

•  bag  of  words 

•  length  of  the  mail  (number  of  words) 

•  number  of  recipients 

•  date  —  epoch,  week  number,  daytime, . . . 
9  ... 
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•  many,  infrequently  occuring  features  (words) 

•  one  word,  many  meanings 
«  one  meaning,  many  words 

•  — >  extensive  preprocessing  necessary 
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•  aggregating  more  of  the  same 
example: 

•  joining  different  feature  spaces 

example:  pgp  signature  data  and  event  data  — >■  who  met 
who  at  which  key  signing  party 

>  DAYS  <-  data . frame (day=c ( "Monday"  ,  "Tuesday", 
.  .  .  )  ,    num=c  (1,    2,    .  .  .  )  ) 

>  SCHEDULE  <-  data. frame (SPK= ("Sven",  "Mitch", 
.  .  .  )  ,    daynum=c  (2,    2,    .  .  .  )  ) 

>  merge (SCHEDULE, DAYS,   by . x="daynum" ,  by.y="num") 

num  day  SPK 

1  2  Tuesday  Sven 

2  2  Tuesday  Mitch 
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simple  whitespace  separated  table 


label  1    label  2    label  3 


1 

3 

2 

1 

2 

5 

2 

3 

3 

7 

3 

5 

4 

8 

9 

2 

5 

labels  are  optional 
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R  is  a  language  and  environment  for  statistical  computing  and 
graphics. 

http : / /www. r-pro ject . org/ 

FreeBSD:  /usr/ports/math/R/ 
Debian: 
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>  data  <-  c(2,    2,    3,    3,    5,    5,    5,    6,    6,  7) 

>  data 

2233555667 

>  range  (data) 

2  7 

>  mean (data) 
4  .  4 

>  median (data) 
5 

>  summary (data) 

Min.  1st  Qu.  Median  Mean  3rd  Qu.  Max. 
2.00  3.00         5.00     4.40  5.75  7.00 
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>  data 

2233555667 

>  var (data) 
3.155556 

>  duta  <-  c(5,    5,    3,    6,    7,    9,    7,    4,    2,  3) 

>  cov (  data,   duta  ) 
-0  .  6 

>  cor    (data,    duta  ) 
-0  .1547056 
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Idea 


eMails  with  similar  subject  lines  are  about  similar  topics 

for  each  list 
O  get  all  subject  lines 

Q  for  all  words:  count  how  often  the  word  occurs  in  the 
subject  lines 

Q  clean  the  lists  from  words,  that  carry  no  information 
compare  the  lists  of  the  word  counts  — >■  clustering 
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example 

A                          X'  1— »      1         IV  VI* 

Association  Ru  e  Minima 

Idea 


mailing  lists  with  many  equal  writers  are  somehow  related 

item  mailing  list 

transaction  all  the  mailing  lists  someone  writes  to 
within  a  week 

we  used  the  mailing  list  archive  of  the  ietf 
«  171  items  (mailing  lists) 

•  2084  transactions  (writers  who  write  to  two  different 
mailing  lists  within  a  week) 


I.  Lutkebohle,  J.  Luning         Applied  Data  Mining 


introduction 

,    .  algorithm 
features        .  . 
■    ..  tool 

exploration 

Association  Rule  Mining 


Association  Rule  Mining 


association  rule: 

dhc  <-  dhcwg  dhcipv6  (10.9,  99.6) 


support 


proportion  of  transactions  which  contain  all  items  from  the  rule 


confidence 


accuracy  —  proportions  of  all  transactions  which  contain  right 
part  of  the  rule  that  also  contain  the  left  part  of  the  rule 
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•  rules  with  enough  support  are  called  frequent 

•  each  subset  of  a  frequent  itemset  has  to  be  frequent 

•  so  the  algorithm  starts  with  small  itemsets, 
checks  if  they  are  frequent  and 

goes  on  to  supersets  of  frequent  itemsets 
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Apriori-lmplementation  by  Christian  Borgelt 

http : / / fuzzy. cs . uni-magdeburg . de/ ^borgelt/ apriori . html 
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imrg  asrg-announce 
ipngwg  ipv6 
ipngwg  ipv6 
atommib  rohc 
ipngwg  ipv6 


./apriori  -s2  -c90  writers  rules. rul 


dhc  <-  dhcwg    (11.1,  97.8) 
dhcwg  <-  dhc    (11.5,  95.0) 
dhcipv6  <-  dhcwg    (11.1,  98.3) 
dhcwg  <-  dhcipv6    (11.6,  94.6) 
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